Skip to content

Harden RAG PDF ingestion#8

Open
kkudumu wants to merge 5 commits into
aietal:masterfrom
kkudumu:codex/rag-ingest-hardening
Open

Harden RAG PDF ingestion#8
kkudumu wants to merge 5 commits into
aietal:masterfrom
kkudumu:codex/rag-ingest-hardening

Conversation

@kkudumu
Copy link
Copy Markdown

@kkudumu kkudumu commented May 14, 2026

Part of the open Algora bounty for [ISAAC-497] Implement an enhanced RAG Pipeline for Scientific/Research Workflows.

/claim #45

Bounty reference: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu

Summary

  • move PDF chunk metadata preparation into a tested server helper
  • make ingestion tolerate missing PDF title/source/page metadata instead of throwing while processing uploaded research PDFs
  • skip blank chunks before writing to Chroma so retrieval does not surface empty context
  • return a clear 400 response when the upload request does not include a PDF
  • remove full-document console logging from the upload path

Why this helps the scientific RAG bounty

Scientific PDFs often have incomplete or inconsistent parser metadata. The current upload path assumes metadata.pdf.info.Title, metadata.source, and metadata.loc.pageNumber are always present, so a single malformed parsed chunk can crash ingestion before the RAG pipeline can retrieve anything. This PR is a focused ingestion reliability slice that complements the existing citation/reranking/context PRs.

Demo

Verification

From ui/:

npx vitest run __tests__/rag-ingest.test.ts
npx prettier --check pages/api/inject-documents.ts utils/server/rag-ingest.ts __tests__/rag-ingest.test.ts
npx tsc --noEmit --pretty false
npm run lint -- --file pages/api/inject-documents.ts --file utils/server/rag-ingest.ts --file __tests__/rag-ingest.test.ts
git diff --check

Results:

  • Targeted Vitest suite passed: 4 tests
  • Prettier check passed
  • TypeScript check passed
  • ESLint passed
  • git diff --check passed

AI-Assisted Disclosure

This contribution was produced with AI assistance and manually reviewed/verified before submission.

@kkudumu
Copy link
Copy Markdown
Author

kkudumu commented May 14, 2026

Hi maintainers / @algora-pbc, quick visibility note for the Isaac/AimenGPT RAG bounty: this PR targets a distinct ingestion-hardening gap in the scientific RAG flow rather than duplicating the citation/reranking/context-budgeting PRs already open.

Bounty reference: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu

Current verification from ui/:

  • npx vitest run __tests__/rag-ingest.test.ts -> 4 passed
  • npx prettier --check pages/api/inject-documents.ts utils/server/rag-ingest.ts __tests__/rag-ingest.test.ts -> passed
  • npx tsc --noEmit --pretty false -> passed
  • npm run lint -- --file pages/api/inject-documents.ts --file utils/server/rag-ingest.ts --file __tests__/rag-ingest.test.ts -> passed
  • git diff --check -> passed

The main behavior change is making uploaded research PDFs with incomplete parser metadata ingest reliably instead of crashing before retrieval can happen.

@kkudumu
Copy link
Copy Markdown
Author

kkudumu commented May 14, 2026

Follow-up pushed in 94474be to handle one more ingestion edge case: if a PDF parses into only blank chunks, the upload route now returns a clear 400 instead of calling Chroma with empty arrays.

Verification after the update from ui/:

  • npx vitest run __tests__/rag-ingest.test.ts -> 5 passed
  • npx prettier --check pages/api/inject-documents.ts utils/server/rag-ingest.ts __tests__/rag-ingest.test.ts -> passed
  • npx tsc --noEmit --pretty false -> passed
  • npm run lint -- --file pages/api/inject-documents.ts --file utils/server/rag-ingest.ts --file __tests__/rag-ingest.test.ts -> passed
  • git diff --check -> passed

@kkudumu
Copy link
Copy Markdown
Author

kkudumu commented May 14, 2026

Follow-up pushed in 0bc1bd2 to preserve citation metadata in scientific RAG chunks.

What changed:

  • Extracts the first DOI from chunk text and stores it as primitive Chroma metadata.
  • Extracts the first publication year and defaults to 0 when absent.
  • Adds a stable 16-character sourceHash derived from source path and page number so retrieved chunks can be grouped back to the originating source/page without leaking document content into IDs.
  • Keeps missing values Chroma-compatible with empty string / zero defaults.

Verification:

  • npx vitest run tests/rag-ingest.test.ts
  • npx prettier --check pages/api/inject-documents.ts utils/server/rag-ingest.ts tests/rag-ingest.test.ts
  • npx tsc --noEmit --pretty false
  • npm run lint -- --file pages/api/inject-documents.ts --file utils/server/rag-ingest.ts --file tests/rag-ingest.test.ts
  • git diff --check

@kkudumu
Copy link
Copy Markdown
Author

kkudumu commented May 14, 2026

Added a short demo video artifact for Algora/reviewer convenience:

This is supplemental evidence for review; the code and tests remain the source of truth.

@kkudumu
Copy link
Copy Markdown
Author

kkudumu commented May 14, 2026

Updated the existing demo video with narrated voiceover explaining the ingestion-hardening changes, safer PDF handling, blank-chunk filtering, citation metadata preservation, and test coverage. The PR's existing demo-video link now points to the narrated MP4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant